How to export data with changefeeds

Note: This article makes reference to offerings and licensing information which are now out of date. Please refer to this blog post for the most current information about CockroachDB licenses.

Exporting data is a crucial tool in any database user’s toolkit. In CockroachDB, the EXPORT command has long provided this essential functionality to:

Move data to a different data store for business analytics
Migrate data to a new database
Archive data in a platform-neutral way
Seed an application with data

Say, for example, you need to export a sizable amount of JSON to seed a core data store for a streaming service. Sometimes you can pre-seed this data alongside your application with an EXPORT. But if you’re working on the scale of a company like Netflix, EXPORT commands come with some limitations that led us to explore an additional way to export data.

Problems with EXPORT

First of all, EXPORTs have some configurability and integration limitations — but these are minor issues compared to two other problems:

EXPORT commands have a natural scale limit: above a certain table size (TBs of data), EXPORTs are likely to run into snags that cause them to restart. Since they must restart from the beginning, it is unlikely a very high data scale EXPORT will ever finish.
EXPORTs have limited observability compared to
jobs like schema changes and backups, since they aren’t actually jobs themselves.

There is a solution to these problems. In CockroachDB you can export data using changefeeds (available in CockroachDB Enterprise, our fully managed offering, CockroachDB dedicated, and CockroachDB serverless).

(CockroachDB Core users are invited to try an experimental feature: EXPERIMENTAL CHANGEFEED FOR. This statement allows users to create a new changefeed to stream row-level changes to the client indefinitely, until the underlying connection is closed or the changefeed is canceled).

How to export data with changefeeds

Enterprise change data capture (CDC) provides row-level change subscriptions for downstream processing. In CockroachDB users can create a new changefeed to stream row-level changes in a configurable format to a configurable sink such as Apache Kafka.

CREATE CHANGEFEED FOR table INTO [kafka, webhook, cloud storage, gc pubsub] WITH format=csv, initial_scan=only;

The CREATE CHANGEFEED statement creates a new CockroachDB enterprise changefeed, which targets an allowlist of tables called “watched rows”. Every change to a watched row is emitted as a record in a configurable format (JSON or Avro) to a configurable sink (Kafka, Google Cloud Pub/Sub, a cloud storage sink, or a webhook sink). You can create, pause, resume, alter, or cancel an Enterprise changefeed.

In the base case, changefeeds send a stream of continuous updates to a table via a SQL connection or external system. However, using the initial_scan=only option, a changefeed will emit a snapshot of the table and the job will complete.

Using changefeeds allows us to take advantage of a wider set of integrations and configurability, as well as the robust-at-scale changefeed architecture.

Automatic data scaling

The huge advantage: in CockroachDB changefeeds are built to scale organically with your data. Here’s how it works:

Changefeeds take advantage of parallel processing of data
Changefeeds use the jobs system, which comes with enhanced observability and scale infrastructure
Changefeeds utilize checkpointing to remain robust as data scales out

Let’s compare CREATE CHANGEFEED to a traditional EXPORT:

For more details about the differences in configurability, check out the export and enterprise changefeed docs.

Changefeeds offer a great alternative to EXPORTs for users at a high scale of data. There are still a few limitations compared to the EXPORT command, but changefeeds are quickly approaching feature parity. Moreover, changefeeds offer many configuration and destination options beyond those offered by EXPORTs.

CDC

SQL

Application Performance

Schema Changes

Dataflow Processing

Kafka